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This  paper  describes  the  implementation  of  three  standard  matrix 
eigenvalue  computation  methods  on  an  array  machine  with  high  efficiency.  A 

description  of  the  ILLIAC  pW computer  is  provided  as  background  material. 
Three  major  sections  follow— the  first  two  describe  Jacobi  and  Householder 
algorithms  for  real  symmetric  matrices,  and  the  third  describes  the  QR  algorithm 
for  real  nonsymmetric  matrices.  Each  of  these  sections  is  divided  into  four 
parts.  The  theoretical  background" of  the  method  is  presented  first.  An  ILLIAC 
implementation  of  the  algorithm  is  presented  and  timing  estimates  are  included. 
Next,  the  efficiency  (ratio  of  number  of  computations  actually  performed  to  number 
computations  possible  in  a  given  time)  of  the  computation  on  an  array  machine  is 
derived  for  primary  memory  contained  matrices.  Finally,  eigenvalue  computations 
for  matrices  too  large  to  be  contained  in  primary  memory  are  discussed  in  terms 
of  a  head-per-track  secondary  storage  device.  It  is  shown  that  for  Jacobi  and 
Householder  algorithms,  parallel  computation  efficiencies  of  90 %  are  possible, 
while  those  of  the  QR  algorithm  are  rather  low. 


ABSTRACT 


This  paper  describes  the  implementation  of  three  standard  matrix 
eigenvalue  computation  methods  on  an  array  machine  with  high  efficiency. 

A  brief  description  of  the  ILLIAC  IV  computer  is  provided  as  background 
material.  Three  major  sections  follow — the  first  two  describe  Jacobi  and 
Householder  algorithms  for  real  symmetric  matrices,  and  the  third  describes 
the  -,'R  algorithm  for  real  nonsymmetric  matrices.  Each  of  these  sections  is 
divided  into  four  parts.  The  theoretical  background  of  the  method  is  pre¬ 
sented  first.  An  ILLIAC  IV  implementation  of  the  algorithm  is  presented  and 
timing  estimates  are  included.  Next,  the  efficiency  (ratio  of  number  of 
computations  actually  performed  to  number  of  computations  possible  in  a 
giver,  time)  of  the  computation  on  an  array  machine  is  derived  for  primary 
memory  contained  matrices.  Finally,  eigenvalue  computations  for  matrices  too 
largo  to  be  contained  in  primary  memory  are  discussed  in  terms  of  a  head-per- 
track  secondary  storage  device. 

It  is  shown  that  for  Jacobi  and  Householder  algorithms,  parallel 
computation  efficiencies  of  90%  are  possible,  while  those  of  the  QR  algorithm 


are  rather  low. 
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1.  INTRODUCTION 


The  implementations  on  a  parallel  computer  of  three  of  the  most 
commonly  used  techniques  for  finding  the  eigenvalues  and  eigenvectors  of 
matrices  are  discussed.  These  are  Jacobi's,  Householder’s  and  the  QR  method. 
Jacobi's  algorithm  for  real  symmetric  matrices  was  modified  in  order  to  take 
advantage  of  parallelism  in  such  a  way  that  instead  of  two,  only  one  ortho¬ 
gonal  transformation  would  eliminate  n  off-diagonal  elements,  where  n  is  the 
size  of  the  matrix.  Storage  schemes  for  all  three  methods  on  the  parallel 
computer  are  discussed  in  detail.  It  is  well  known  that  Jacobi's  algorithm 
proved  to  be  highly  efficient  for  parallel  computation;  yet,  its  use  is 
recommended  primarily  when  evaluation  of  all  the  eigenvalues  is  required. 

Even  then,  if  high  accuracy  is  also  desired,  it  should  be  used  in  conjunction 
with  a  scheme  for  establishing  machine  bounds  for  the  eigensystem. 

Householder's  algorithm  for  tri diagonal! zation  of  a  real  symmetric 
matrix  has  also  proven  to  be  very  efficient  on  a  parallel  computer;  a  parallel 
modification  of  the  method  of  bisection  can  be  used  on  the  resulting  tri¬ 
diagonal  matrix  to  obtain  one  or  two  eigenvalues  or  to  investigate  the  distri¬ 
bution  of  all  of  them.  If,  however,  all  the  eigenvalues  are  required,  the  QR 
method  is  usually  preferred.  The  QR  algorithm  is  discussed  here  only  for  non- 
symmetric  matrices.  Implementation  of  this  method  on  the  parallel  machine 
proved  to  be  less  efficient  than  the  other  two,  though  it  is  the  most  reliable 
for  obtaining  the  eigenvalues  of  any  matrix.  For  Householder’s  or  the  QR 
algorithm,  the  eigenvectors  of  the  tridiagonal  or  the  upper  Hessenberg  mat¬ 
rices  may  be  obtained  by  the  method  of  inverse  iteration.  If  the  transfor¬ 
mation  matrices  used  in  reducing  the  matrix  to  the  condensed  form  are  stored, 
the  eigenvectors  of  the  original  matrix  can  be  readily  computed. 


-1- 


2.  ILLIAC  IV  OVERVIEW 


2.1  System  Organization 

ILLIAC  IV  is  an  array  of  coupled  computers  driven  by  instructions 
from  a  common  control  unit.  Each  of  the  processing  elements  (PE's)  has  2048 
words  of  64-bit  memory  with  a  cycle  time  under  350  nanoseconds.  Each  PE  is 
capable  of  64-bit  floating  point  multiplication  in  about  500  nanoseconds  and 
addition  in  about  350  nanoseconds.  Two  32 -bit  floating  point  operations  may 
be  performed  in  each  PE  in  approximately  the  same  time.  The  PE  instruction 
set  is  similar  to  that  of  conventional  machines,  with  two  exceptions.  First, 
the  PE's  are  capable  of  communicating  data  to  four  neighboring  PE's  by  means 
of  routing  instructions.  Second,  the  PE's  are  able  to  set  their  own  mode 
registers  to  effectively  disable  or  enable  themselves. 

Figure  1  shows  64  PE's,  each  having  three  arithmetic  registers 
(A,  B,  and  C)  and  one  protected  programmer  register  (S).  The  registers,  words, 
and  paths  in  Figure  1  are  all  64  bits  wide,  except  the  PE  index  registers  (XR), 
mode  registers,  and  as  noted.  The  mode  register  may  be  regarded  as  one  bit 
which  may  be  used  to  block  the  participation  of  its  PE  in  any  action.  The 
routing  registers  are  shown  connected  to  neighbors  at  distances  of  +  1  and  +  8; 
similar  end-around  connections  are  provided  at  1,  64,  etc.  Programs  and  data 
are  stored  in  PE  memory.  Instructions  are  fetched  by  the  control  unit  (CU)  as 
required  from  PE  memory. 

Figure  1  also  shows  the  essential  registers  and  paths  in  the  CU  and 
their  relations  to  the  PE's.  Instructions  are  decoded  and  control  signals  are 
sent  to  the  PE  array  from  the  control  unit.  Some  instructions  are  executed 
directly  in  the  CU,  e.g.,  the  loading  of  CU  accumulator  registers  (CAR's)  with 
program  literals.  Operand  addresses  may  be  indexed  -once  in  the  CU  by  a  CAR  and 
again  separately  in  each  PE  by  XR.  It  is  possible  to  load  the  local  data 
buffer  (64  words  of  64  bits  each)  and  CAR's  from  PE  memory.  Local  data  buffer 
registers  and  CAR's  may  be  loaded  from  each  other.  A  broadcast  instruction 
allows  the  contents  of  a  CAR  to  be  transmitted  simultaneously  to  all  PE's.  It 


"'"A  64-PE  ILLIAC  IV  system  is  expected  to  be  operating  soon.  For  a  more  com¬ 
plete  description  of  the  system  see  [1]. 


FIGURE  1 
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is  often  convenient  to  manipulate  all  PE  mode  bits  or  a  number  from  one  PE  in 
a  CAR.  For  this  purpose,  the  broadcast  path  is  bi-directional. 

The  complete  ILLIAC  TV  system  includes  a  Burroughs  6500  which  per¬ 
forms  input/output  operations,  compilation,  and  contains  an  operating  system 

Q 

to  control  ILLIAC  IV.  There  is  a  10  bit,  head-per-track  disk  with  a  40 

millisecond  rotation  speed  and  an  effective  transfer  rate  of  10^  bits/second. 

This  allows  the  loading  of  the  8  x  10^  bit  ILLIAC  IV  memory  from  the  disk  in 

32  milliseconds.  The  input/output  controller  (IOC)  contains  a  queuer  for 
18 

disk  requests  and  2  bit  buffer  memory  to  smooth  transmissions  to  and 
from  the  slower  B6500  memory,  or  an  external  input/output  device. 

2.2  Programming  ILLIAC  IV 

It  is  apparent  that  ILLIAC  IV  may  be  suited  to  the  execution  of 
algorithms  defined  on  arrays  of  data  (e.g.,  matrix  operations  and  certain 
partial  differential  equations).  However,  there  are  two  major  difficulties 
in  using  such  a  machine.  One  is  that  an  algoritnm  must  be  devised  which 
allows  identical  computations  to  be  performed  simultaneously  on  a  large  num¬ 
ber  of  operands.  The  other,  closely  related  to  the  first,  is  that  the  data 
must  be  so  placed  in  the  ILLIAC  IV  memory  that  over  the  course  of  the  calcu¬ 
lation  few  PE's  are  disabled.  In  other  words,  the  algorithm  and  the  data 
storage  allocation  must  be  mutually  designed  for  highly  parallel  computation. 
For  this  reason,  ILLIAC  IV  codes  are  expressed  as  two  parts,  a  storage  allo¬ 
cation  block,  followed  by  a  computational  algorithm  block.  These  codes  may 
be  written  in  a  high  level  language.  See,  for  example,  [2]. 
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3-  JACOBI'S  METHOD 


3. 1  Theoretical  Background 

The  classical  Jacobi  method  reduces  a  symmetric  matrix  to  the 
diagonal  form  by  a  series  of  orthogonal  transformations,  [3], 


.  ^ 

,  =  ffi  A  cp 
hi  r 


Each  transformation  eliminates  two  off-diagonal  elements.  It  is  possible, 
however,  to  eliminate  n  off-diagonal  elements  by  one  orthogonal  transforma¬ 
tion,  where  n  is  the  order  of  the  matrix  A.  This  can  be  achieved  if  the 
transformation  matrices  <pr  are  of  the  form. 


assuming  that  n  is  even,  and  where, 
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if  the  elements  of  the  transformation  submatrix  are  chosen  such  that. 
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t  I  - 

i  0 

■where  ¥  -  ~b - «*  "f  '  ,  Wt  =  1,  [I]  being  the  identity  matrix,  and 

- 1--, 

A  !  °_ 


ej  =  (1,0),  ej  =  (0,1). 


In  effect,  this  permutation  shifts  the  second  row  to  the  bottom  and  the 

I 

second  column  to  the  far  right.  The  matrix  A  is  then  ready  for  another 


transformation  A 


«  ,  A  ,  ®  It  can  easily  he  shown  that  all  the 

vr+l  r+1  ■  r+1 


off-diagonal  elements  will  he  exposed  to  this  process  of  elimination  after 
every  (n  -  l)  orthogonal  transformation,  if  the  first  row  and  first  column 
are  shifted  to  the  bottom  and  far  right,  respectively.  This  second  kind  of 
shifting  is  equivalent  to  the  permutation 

A  =  T  A  T*  (5) 

n  n 


where  r  = 


i  ,  rr^  =  i. 


To  check  for  convergence  the  sum,  E,  of  squares  of  the  off-diagonal 
elements  is  compared  to  the  sum,  D,  of  squares  of  the  diagonal  elements 
after  every  orthogonal  transformation.  If  the  ratio  q  =  E/D  is  less  than 
an  arbitrarily  small  number  £,  then  the  diagonal  elements  of  the  matrix  Affi 

are  assumed  to  have  converged  to  the  eigenvalues  of  the  original  matrix  A^. 

In  general,  each  diagonal  element  /a^^  will  lie  in  an  interval  of  width 

"  in 


3/e.  The  practically  diagonal  matrix  A^  will  be  of  the  form: 
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where 


W 


(Vi 


r  <p 


n-l 


¥  »2 


t 


is  the  matrix  whose  columns  are  the- eigenvectors. 


3- 2  Implementation 

Consider  the  implementation  of  Jacobi ' s  method  on  a  16-PE  machine 
using  the  storage  allocation  scheme  of  Figure  2  for  an  upper  triangular 
16  x  16  matrix.  The  mapping  of  element  a. .  into  PE  memory  follows,  where 
P£  (assumed  even)  is  the  number  of  PE's  being  used  ([xj  is  the  greatest 
integer  <  x) : 


This  storage  allocation  scheme  is  used  because  of  the  negligible  amount 
of  routing  involved  in  implementing  the  Jacobi  process.  A  step-by-step 
discussion  of  the  process  follows. 


First,  consider  the  calculation  of  the  transformation  matrices  of 
Equation  1  according  to  Equations  2  and  3-  The  transformation  matrices  may 
be  computed  in  the  pirs  of  PE's  shown  in  Figure  3.  Proper  a^  elements  may 
be  accessed,  as  shown  in  Figure  3,  in  three  memory  cycles  plus  fewer  than  ten 
unit  routes  (distance  »  or  4)  under  mode  control.  This  entire  time  is  negli- 
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FIGURE  2 
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gible  with  respect  to  the  times  for  the  calculations  which  follows. 

The  computations  of  Equations  2  and  3  are  carried  out  in  the  pairs 
of  PE's  shown  in  Figure  3-  The  time  required  for  these  is  less  than  15 
microseconds. 


The  major  part  of  the  computation  time  is  consumed  in  carrying  out 

the  transformations  indicated  by  Equation  0.  This  computation  involves  a 

number  of  2  x  2  matrix  multiplications,  and  it  is  important  to  see  that  these 

may  be  performed  efficiently  using  the  storage  allocation  scheme  of  Figure  2. 

Notice  that  for  0  <  i  <  7,  a. .  and  a.  .  -  are,  in  blocks  of  four,  four  PE's 
“  -  ij  i,J+l 

apart,  arid  can  be  brought  to  the  same  PE  in  one  unit  route.  Across  the  boun¬ 


daries  of  the  blocks  of  four,  elements  are  distance  five  apart, 
for  0  <  i 


Similarly, 


__  6,  a  and  ^  are  five  PE's  apart  and  can  be  brought  to  the 

same  PE  in  two  unit  routes.  For  8  <  i  <  15,  a. .  and  a.  .  are  five  PE's 

-  -  ij  i,j+l 

apart  while  for  8  <  i  <  14,  a. .  and  a  .  are  four  PE's  apart  in  blocks  of 

™ ’  ij  i+1 > J 

four  with  five  PE  distance  at  the  block  boundaries.  The  distances  between 
and  ag^.  are  irregular  but  these  rows  do  not  interact  while  so  labeled. 

By  moving  the  computed  transformation  matrices  TQ,  .  .  .  ,  T^to 

the  control  unit,  pairs  (Tq,  T1),  (Tg,  T^) ,  etc.  can  be  broadcast  to  the 

appropriate  PE's.  Figure  4  shows  the  first  two  steps  of  this  process.  Line 

1  shows  the  subscripts  of  the  a„  which  may  be  accessed  in  one  memory  cycle 

and  line  2  shows  the  subscripts  of  the  A. .  partitions  to  which  the  a.  .  of 

IJ  1 J 

line  1  belong.  By  breadcasting  TQ  and  T^  from  the  control  unit,  T0A00T0 ’ 
^C^Ol^l’  ^1^11^1  8X6  comPu^et^  parallel.  Lines  3  and  4  show  similar 
subscripts  for  the  second  step  of  the  processs,  during  which  TpA^Tg,  TQA0gTg, 

T1A12T2’  811(1  T1A13T3  are  comPu"ted- 


Since  TQ  and  1^  were  broadcast  in  step  1, 


they  are  still  available  in  step  2.  The  2x2  matrix  multiplications  are 
facilitated  by  spacing  the  adjacent  row  and  column  elements  one  or  two  unit 
routes  apart.  On  the  next  step  AQi+,  AQ^,  A^  and  A^  will  be  transformed. 
This  processing  is  continued  by  rows  until  the  entire  matrix  has  been  trans¬ 
formed.  Notice  that  at  the  main  diagonal  this  procedure  suffers  from  an 
efficiency  of  approximately  50$  if  computed  in  the  straightforward  way  shown 
above;  this  point  will  be  discussed  later. 

The  total  time  to  carry  out  one  transformation  on  a  64  x  64  upper 
triangular  matrix  is  approximately  6  milliseconds  using  64  PE's.  The  time 
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to  transform  a  square  64  x  64  matrix  is  less  than  12  milliseconds. 

Next,  the  rows  as  shown  in  Equation  4  are  permuted.  The  storage 

scheme  of  Figure  2  allows  this  to  be  done  by  moving  relatively  few  elements 

of  the  matrix.  Figure  5  shows  the  resulting  matrix  in  which  rows  0,  1,  and 

8  and  columns  4,  8,  and  12  have  been  moved.  Each  column  is  moved  left  one, 

row  0  is  moved  right  three  unit  routes,  and  rows  1  and  8  are  moved  somewhat 

irregularly.  However,  the  total  amount  of  routing  time  involved  is  negligible 

with  respect  to  the  transformation  time.  In  general,  three  rows  and\TP_  -  1 

E 

columns  will  be  moved  in  a  similar  way. 

Figure  6  is  a  relabeled  version  of  Figure  5,  showing  the  new  names 
of  the  shuffled  elements.  Note  by  comparison  with  Figure  2  that  the  origin 
has  moved  nine  PE's  to  the  right  and  one  row  down  in  PE  memory.  This  is  a 
matter  which  is  easy  to  accommodate  in  a  program  for  carrying  out  the  Jacobi 
process.  At  this  point  the  entire  process  is  repeated,  started  by  computing 
a  new  set  of  transformations.  After  n  -  1  such  steps  the  second  shuffle  is 
performed  according  to  Equation  5- 

The  second  shuffle  requires  less  routing  than  the  first.  Row  0's 
routing  is  similar  to  the  routing  of  row  1  in  the  above  discussion,  with 
columns  4,  8,  and  12  pushed  back  distance  one  to  accommodate  the  new  column. 

3*3  Efficiency 

First,  an  efficiency  calculation  for  the  simple  process  described 

above  shall  be  performed,  operating  on  a  (kvP_  x  k  sTp  )  matrix  using  P  PE's. 

E  E  E 

k-i. 

Assuming  one  time  unit  to  operate  on  a  (•/]?„  xnTp_)  partition,  £  =  k(k  -  l)/2 

E  E  .i-l 

time  steps  is  used  for  square  partitions  plus  k  time  steps  for  triangular  par¬ 
titions  on  the  main  diagonal.  If  the  elements  which  are  to  be  made  zero  are 

k 

not  computed  (just  set  to  zero),  only  ^  steps  should  be  used  to  transform  the 
triangular  partitions  on  the  main  diagonal.  Thus,  there  is  an  efficiency  of 

k(k  -  l)/2  +  k/2  k  . 

k(k  -  l)/2  +  k  '  k  +  1 

For  a  64  x  64  matrix  and  64  PE  s,  (i.e.,  k  =  8  and'/l’  =  8),  there  is  an 

O  E 

efficiency  of  ^  s  89$. 
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This  same  technique  and  storage  allocation  scheme  may  he  used  on 

matrices  of  dimensions  less  than  the  number  of  PE's.  The  discussion  holds 

of  the  transformation  of  AQ2,  AQ^,  A^g,  and  A^  operated  on  a*ZPE 

partition  of  a  larger  matrix.  The  process  may  be  used  repeatedly  on  a  large 

matrix  or  just  once  on  aZ"PE  x  sTr^  matrix,  full  efficiency  can  be  achieved 

on  square  matrices  of  dimension  k'/"p,,  x  k-/p  . 

h  b 

The  maximum  efficiency  which  can  be  achieved  for  the  Jacobi  process 
on  a  parallel  machine  depends  on  how  the  diagonal  partitions  and  any  columns 
in  addition  to  k  JrP  are  treated.  Suppose  an  attempt  is  made  to  transform 
simultaneously  A^.  AQ1,  A^,  A^,  A^,  and  k  of  the  example,  and  to  sub¬ 
sequently  repeat  this  pattern  on  the  main  diagonal.  This  procedure  will 
achieve  100%  efficiency  during  the  transformation  of  the  triangular  matrices, 
but  an  irregular  route  of  two  elements  will  be  incurred  since  and  a.|  1  1 1 
are  in  the  same  PE  as  are  and  ^ . 

Matrices  of  dimension  larger  than  64  x  64  may  be  partitioned  into 
a  number  of  blocks  of  dimension  64  x  64  or  smaller.  For  example,  a  100  x  100 
matrix  becomes  a  64  x  64  upper  triangular  partition,  a  36  x  36  upper  trian¬ 
gular  partition  and  a  64  x  36  rectangular  partition.  All  rows  of  lectangular 
partitions  are  stored  according  to  the  mapping  given  earlier  for  rows  between 
0  and  Pg/2  -  1.  in  other  words,  the  lower  part  of  the  matrix  is  not  packed 
into  holes  in  the  upper  part,  but  is  stored  in  rows  just  like  the  upper  part. 
In  applying  the  Jacobi  process  to  such  a  rectangular  matrix,  proceed  as  above 
transforming  blocks  of  dimension  'J~P„  x  sT P„  and  achieving  full  efficiency  on 
partitions  of  dimension  pv^P  x  q>Tp^.  If  the  dimension  of  the  matrix  is  not 

iw  Cj 

a  multiple  of \T p„,  the  leftover  columns  may  be  handled  with  100%  efficiency 
if  the  transformations  are  treated  as  column  operations.  Entire  columns  are 
available  in  one  memory  cycle  using  the  storage  allocation  scheme  described 
above . 


Thus  it  seems  reasonable  to  conclude  that  efficiencies  of  90%  or 
greater  may  be  attained  for  any  size  matrix  without  inordinate  programming 
difficulties. 
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3 • 4  Disk  Considerations 


Large  matrices  that  do  not  fit  into  primary  memory  are 
partitioned  into  blocks  of  64  x  64.  The  number  of  rows  of  blocks  is 


LN  -  1 


where  N  >  64  is  the  size  of  the  matrix.  Let  the  number 


of  any  block  b  be  given  by 


b  =  R  +  Ci;1  k  (C,R  =  1,  2,  ...  ,  NR) 

k=0 

where  R  is  the  row  number  of  the  block  and  C  is  the  column  number  of  the 

block.  The  blocks  R-l 

b  =  R  +  £  k,  R  =  (1,  2,  •  •  •  ,  NR), 

k=0 

are  brought  in  from  the  disk  sequentially  to  be  subjected  to  an  orthogonal 
transformation,  Equation  0.  Thus  all  the  rotation  angles  are  determined  and 

the  rest  of  the  blocks,  C-l 

b  =  R  +  £  k,  K  =  (0,  1,  .  .  .  ,  NR)  C  /  R 

k=0 

are  fetched  from  the  disk  to  complete  the  given  transformation  of  the 
whole  matrix.  Some  of  the  rows  and  columns  of  each  submatrix  are  stored 
in  arrays  in  primary  memory,  for  purposes  of  the  shifting  process  that 
takes  place  after  the  whole  matrix  is  subjected  to  a  transformation.  There¬ 
fore,  the  following  elements  will  be  stored  in  primary  memory  for  shifting 
the  second  row  and  second  column: 


Submatrix 


Elements  to  be  stored 


1.  b  =  R  + 


R-l 
£  k 
k=0 


(P  >  1) 


first  row  of  the  submatrix,  i.e., 
the  elements  a. .. 

i  =  64(R  -  1) 

j  =  64(R  -  1),  .  .  .  ,  64  R  -  1 


The  results  of  this  section  are  valid  for  various  secondary  storage 
devices,  e.g.,  drum,  delay  line  or  the  head  per  track  disk  which  we 
discuss. 
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C-l 

2.  b  =  R  +  £  k 

k=0 

(R  >  1,  C  >  R) 


C-l 

3.  b  =  1  +  Z  k 

k=0 

C  >  1 


4.  b  =  1 


first  row  and  first  column  of  the 
submatrix,  i.e.,  the  elements 

(\|)  and  (ars)  ' 
k  =  64(r  -  1) 

?.  =  64(c  -  1),  .  .  ,  ,  64  c  -  1 
s  =  64(c  -  1) 

r  =  64(r  -  1),  .  .  .  ,  64  r  -  1 

second  row  and  first  column  of  each 
submatrix,  i.e.,  the  element 

(a  ]  and  la,  1  . 

pqj  \  tu ) 

p  =  1 

q  =  64(C  -  1),  .  .  .  ,  64  C  -  1 
u  =  64(C  -  1) 

t  =  0,  .  .  .  ,  63 

second  row  and  second  column  of  the 
submatrix,  i.e.,  the  elements 

(a  I  and  fa  )  , 
vw/  \  xy/ 

v  =  1 

w  =  1,  .  .  .  ,  63 

x  =  0 

y  =  1 


For  the  first  row  and  first  column  shifting,  the  elements  to  be  stored 
in  the  core  are  the  same  as  above  except  those  of  items  (3)  and  (4): 


1 


I. 

t 

[ 

L 

L 

! 


c  >  1 


4  .  b  =  1 


Submatrix 

Elements  to  be  stored 

C-l 

.  +  Z  k 

k=0 

first  row  and  column  of  each  sub¬ 
matrix,  i.e.,  the  element  fa  1  , 

l  pqf 

(a  ) 

l  tu)  ' 

P  =  o 

q  =  64(c  -  1),  .  .  .  ,  64  C  -  1 
u  =  64(c  -  l) 
t  =  0,  .  .  .  ,  63 

first  row  only,  i.e.,  the  elements 

(avw)  ’ 

V  =  0 

w  =  0,  .  .  .  ,  63 

The  total  number  of  elements  to  be  stored  in  primary  memory  in  each 
kind  of  shifting  is  N(NB  +  1). 

The  primary  memory  of  64  PE’s  is  capable  of  holding  32  partitions 
64  x  64j  the  transformation  time  for  a  square  block  is  about  12  ms,  and  the 
rotational  delay  of  the  head-per-track  disk  is  40  ms.  If  the  partitions  on 
the  yffsk  are  laid  out  in  such  a  way  that  four  of  them  are  accessed  at  once, 
then  48  ms  later  the  four  64  x  64  partitions  will  have  been  transformed  and 
the  disk  will  have  rotated  more  than  one  complete  revolution.  Thus,  if  the 
requests  for  four  partitions  in  the  disk  queuer  are  kept  one  rotation  ahead, 
the  rotational  latency  of  the  disk  can  be  masked  in  computing  on  large  matrices. 
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__T^r 


. 


4.  HOUSEHOLDER’S  METHOD 


4.1  Theoretical  Background 

This  method  [3)4]  reduces  a  symmetric  matrix 
A  =  [a. .]  (i,j  =  0, 1, 2, . . . , n-1)  to  the  tridiagonal  form  T  using  elementary 
Hermitian  orthogonal  matrices.  There  are  (n-2)  steps  in  this  reduction. 

We  define  the  matrices  by  the  relations, 

Ar  =  Pr  Ar_x  Py  (r  =  1,2,..., n-2)  (6) 


where  the  elements  are  those  of  Ay  ^ .  The  signs  in  both  expressions 
(8)  and  (10)  are  chosen  such  that, 


a  .  +  S 

r-l,r  —  r' 


•ar-l,rl 


Therefore,  in  the  r-th  step  (Ar  ~  pr  Ar  p  P  )  [n  -  (r  +  l)]  zeros  are  in¬ 
troduced  in  the  r-th  column  and  row  of  A r  ^  without  destroying  the  zeros 
introduced  in  the  previous  step. 

In  order  to  take  advantage  of  symmetry,  the  matrices  A may  be 
defined  by  the  relations  [3], 
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(12) 


(13) 


(1*0 


The  total  number  of  multiplications  required  to  reduce  the  matrix  to  the 

3 

tridiagonal  form  is  essentially  2  n  . 

3 

The  eigenvalues  of  the  tridiagonal  matrix  may  be  obtained  by 
several  methods.  For  this  parallel  computer  it  is  proposed  that  if  all  the 
eigenvalues  are  required,  then  the  algorithm  -with  origin  shift  [3,5,6]  is 
used,  (this  algorithm  is  discussed  for  real  nonsymmetric  matrices  in  the 
following  section).  However,  if  one  is  interested  in  the  general  distribu¬ 
tion  of  the  eigenvalues  and  not  in  their  accurate  values,  or  the  eigenvalues 
in  a  given  interval,  the  bisection  method  [3,7]  inay  he  used.  In  order  to 
show  how  the  method  of  bisection  would  be  used  for  a  parallel  computer, 
assume  that  it  is  necessary  to  obtain  all  the  eigenvalues  of  the  resulting 
tridiagonal  matrix  T  using  such  a  method.  For  the  sake  of  illustration, 
assume  also  that  all  the  subdiagonal  elements  are  non-zero  and  hence  the 
matrix  has  simple  eigenvalues.  The  eigenvalues  lie  in  the  interval  [-b,b], 

where  b  =  ||  T  ||  =  max  E  |  t,  . | .  This  interval  is  then  divided  into 

"  i  i  13 

63  subintervals  and  for  all  (k  =  l,a, . . . ,6k),  where  p^  =  -b,  and  p^  *  +b, 

the  quantities. 
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(15) 


fo(V  "  1 

flK}  =  too  - 

fi(wk}  =  (ti-l,i-l  -  \]  -  *1-2,1-!  fi-2(^(i  = 


are  evaluated  in  parallel,  one  EE  to  each  /u^.  For  each  the  number  of 
agreements  in  sign  s(p^)  of  consecutive  members  of  the  sequence  f ^p^), 
fl(Mj_),  is  the  number  of  eigenvalues  larger  than  p^.  The 

process  is  repeated  for  those  intervals  [p^>  that  contain  more  than 

one  eigenvalue  until  all  the  eigenvalues  are  separated.  Assuming  that  the 
matrix  is  of  order  64,  6k  intervals,  each  containing  an  eigenvalue  are 
established.  By  assigning  each  interval  to  a  PE  and  by  successive  bisection 
of  each  interval,  using  the  Sturm  sequence  property,  all  the  eigenvalues 
are  obtained. 


Once  all  the  eigenvalues  are  available,  the  eigenvectors  of  the 
tridiagonal  matrix  are  obtained  by  the  method  of  inverse  iteration  [3,8]. 
Again  this  can  be  done  in  parallel,  each  PE  evaluating  an  eigenvector. 
Finally  if  Z±  (i  =  1,2, ...,n)  is  ah  eigenvector  of  the  tridiagonal  matrix 
T,  then  the  corresponding  eigenvector  Y^  of  the  original  matrix  A  is  com¬ 
puted  by, 


Y. 

l 


P  P 

1  2 


n-2 


Z. . 

l 


(16) 


4.2  Implementation  of  Householder  Tridiagonalization 

We  shall  consider  the  implementation  of  Householder's  method 
on  a  16-PE  machine  using  the  storage  allocation  scheme  of  Figure  7  for 
an  upper  triangular  16  X  16  matrix.  This  scheme  is  somewhat  less  com¬ 
plicated  than  Figure  2  since  it  is  not  necessary  to  provide  for  shuffling.  In 
particular,  adjacent  row  and  column  elements  are  always  a  unit  route 
from  each  other,  except  between  the  seventh  and  eighth  rows.  The  lower 
half  of  the  matrix  is  stored  in  reverse  order  to  allow  efficient  com¬ 
putation  on  this  part  of  the  matrix.  This  scheme  also  allows  access  to 
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any  row  with  one  memory  fetch.  Operations  may  be  performed  on  PE  X  P£ 
blocks  at  any  step,  where  the  size  of  the  matrix  n  is  equal  to  the  number 


of  lE*s  Pg. 


if  0  <  i  <  PE  -  1, 


The  memory  map  for  this  scheme  consists  of  two  parts: 

E 
2 


a.  .  -»  loc  i 


,  PE  £((i(mod  /PE))  >Tpe  +  j)  (mod  PgjJ  ;  (1?) 


P  P 

if  E  <  i  <  E  -  1 


a.  .  -*  loc  E  -  i,  PE 
ij 


J(P£  -  1)  -  ( ( i (mod  /Pg )) / P£  +  j )  (mod  PE)J  . 


A  step-by-step  discussion  of  the  process  follows. 

As  the  first  step  (r  =  1),  consider  row  0  of  the  matrix,  (aQQ,  aQ^, 
.  .  an  -£.)•  This  row  is  in  location  0,  Figure  7.  The  first  diagonal 
element  of  the  tridiagonal  matrix  T  is  tQQ  =  a^-  The  element  tQ1  takes 
the  value  of  +  determined  by  Equation  9  (with  r  =  1)  and  sign  opposite  to 
that  determined  by  Equation  11.  The  rest  of  the  elements  of  the  first  row  of 
T  are,  of  course,  zero.  The  vector  is  then  readily  constructed  by  Equa¬ 
tions  8,  9,  and  10  in  a  time  which  is  negligible  compared  to  the  following 
calculations.  The  vectors  p^  and  are  then  computed  with  almost  full 
efficiency  using  Equations  13  and  14. 


Once  u^,  q^,  and  the  first  row  of  the  tri-diagonal  matrix  T 
are  constructed,  matrix  Ag  is  obtained  using  Equation  12.  However,  this 
transformation  can  be  applied  on  the  individual  x  v  Pg  submatrices  A^, 
where  B,  C  =  (0,  1,  .  .  .  ,  «/pe  -  l),  thus  dividing  the  vectors  ^  and  q^^ 
into  nTp  subvectors  each  of  */" P£  components.  The  submatrices  A^  at  any 
step  r  are  given  by, 


<v,  ■  -  48)  4C)  -  4R)  4C) 


(18) 


for  all  R  and  C  such  that  R  <  C. 
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It  may  be  noted  from  Figure  7  that  for  the  steps  r  =  (l,  2,  .  .  .  , 
sTPg) ,  the  efficiency  of  parallel  computation  on  the  submatrices  AQ1,  AQ2, 

and  AQy  Figure  8,  descreases  as  r  increases,  while  that  of  the  remaining  off- 
diagonal  submatrices  A ^  ,  and  A^  is  100$.  Moi cover,  efficiencies  of 

higher  than  50$  can  be  obtained  for  the  diagonal  submatrices  A^,  A^g,  and  A^* 
The  entire  time  required  for  reducing  a  64  x  64  symmetric  matrix  to  the  tri¬ 
diagonal  form  on  a  64- PE  machine  is  less  than  4  ms. 


4. 3  Efficiency  of  the  Tridiagonalization 

The  efficiency  of  the  Householder  processs  on  a  symmetric  P_  x  P_ 

£  £ 

matrix  is  computed  using  the  storage  allocation  scheme  of  Figure  J.  Assume 
that  computations  are  performed  on  sT P_  x  nTp„  blocks  and  that  no  attempt  is 

£  Ei 

made  to  mask  inefficiencies  incurred  due  to  the  annihilation  of  rows  and 

columns  within  particular  ■/"p  x»/p„  blocks.  It  is  assumed  that  diagonal 

£  £ 

blocks  are  transformed  in  pairs  and  that  no  inefficiency  is  incurred  when 
there  is  an  even  number  of  such  blocks.  In  Figure  7  it  may  be  noted  that  for 
example,  the  upper  left  and  lower  right  triangular  blocks  may  be  accessed  and 
transformed  simultaneously,  i.e.,  the  elements  with  subscripts  (0,0;  0,1;  0,2; 

0,3:  15,15;  1,1;  1,2;  1,3;  14,15;  l4,l4;  2,2;  2,3;  13,15;  13,14;  13,13;  3,3). 

This  pairing  can  be  observed  along  the  entire  diagonal.  It  is  also  possible 
to  overlap  parts  of  square  blocks  which  have  been  annihilated.  For  example, 
when  *T Pg/2  rows  in  a  strip  of  block  are  set  to  zero,  adjacent  blocks  could  be 
paired  and  100$  efficiency  regained.  Similarly  after  sf Pg/4,  sT Pg/8,  .  .  .  , 
etc.  rows  are  annihilated,  full  efficiency  could  be  regained.  This  latter 
matter  requires  some  delicate  discussion  as  well  as  programming  which  will 
not  be  included  here. 


The  number  of  operations  required  to  carry  out  the  equations  of 
Householder's  process  on  a  triangular  matrix  of  decreasing  dimension  is 


E 

j=l  i=l 


J 

T 


=  §  E  (42  +  j) 
j=l 


p3  op?  2P 

E  +  E  +  E 
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The  number  of  operations  performed  by  a  parallel  maching  using 
the  storage  allocation  scheme  of  Figure  7,  with  the  above  assumptions, 
follows.  There  are  Pg  elements  per  '/Pg  X  block  and  each  block  is 

swept >Tp„  times  per  pass.  This  product  times  the  total  number  of  sf P„  x  nTP- 

blocks  to  be  transformed  in  all  passes  is 

r  r^B-1  J  ^PE 

'/P1pP_  £  Z  i  +  2  I  i 

E  E  j=l  i=l  i=l 


where  the  first  term  in  the  brackets  counts  the  number  of  rectangular 
blocks,  and  the  second  term  counts  the  number  of  paired  triangular  blocks 
on  the  diagonal.  This  is  equal  to 


Thus  for  the  Householder  process  on  symmetric  matrices  we  have: 


processing  efficiency  = 


ops.  required 


ops.  used 


4  *  3tr  *  2 

4  *  1-^’pe  •  pe  *  2pe 


If  n  =  Pg  =  64  this  yields  an  efficiency  of  approximately  86$.  With  the 
introduction  of  the  pairing  of  partially  annihilated  rectangular  blocks 
mentioned  above,  efficiencies  of  well  over  90$  can  be  achieved. 

As  in  our  discussion  of  the  Jacobi  process,  matrices  of  larger 
and  smaller  dimensions  may  be  handled  using  the  same  storage  scheme. 
Matrices  of  dimension  smaller  than  the  number  of  PE's  are  handled  using 
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sf p  x'/p_  blocks.  Those  of  dimension  larger  than  the  machine  size 

£j  ili 

are  partitioned  into  triangular  and  rectangular  blocks.  The  rectangular 

p 

blocks  are  stored  using  the  memory  map  specified  for  rows  0  <  i  <  _E  -  1 

2 

for  all  rows,  since  there  are  no  holes.  In  the  case  of  either  large  or 
small  matrices  of  dimension  other  than  a  multiple  of  '/Pg  the  above  effi¬ 
ciency  formula  is  a  close  approximation  of  the  correct  value. 


4.4  Disk  Considerations 

Large  matrices  that  do  not  fit  the  primary  memory  are  parti¬ 
tioned  into  blocks  of  64  x  64.  The  number  of  rows  of  blocks  is, 

™  J+ 1 

where  N  >  64  is  the  size  of  the  matrix.  The  procedure  is  the  same  as 
above.  At  the  r-th  step,  the  first  row  of  the  lemaining  matrix  is  fetched 
from  the  disk  and  the  vectors  u  ,  pr>  and  q^  are  constructed.  The  64  x  64 
blocks  are  then  fetched  from  the  disk  one  at  a  time  and  transformed  as 
discussed  above.  The  efficiency  of  the  computation  increases  as  the  size 
of  the  matrix  gets  larger. 

The  elements  of  the  resulting  tridiagonal  matrix  are  stored  in 
the  primary  storage.  Because  of  symmetry,  the  number  of  these  elements  is 
only  (2N  -  1).  The  time  taken  for  the  transformation  of  an  off-diagonal 
64  x  64  block.  Equation  18,  on  a  64- PE  machine  is  approximately  6  ms., 
assuming  that  the  vectors  u,  p,  and  q  are  already  available.  Thus  if 
requests  for  eight  64  x  64  partitions  in  the  disk  queuer  are  kept  one 
rotation  ahead,  the  rotational  latency  of  the  disk  would  he  masked. 
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5.  THE  QR  ALGORITHM 


5-1  Theoretical  Background 

Throughout  this  discussion  it  is  assumed  that  the  matrix 
A  =  [a..]  (i,j  =  0, 1, 2, . . .  ,n-l)  under  consideration  is  both  real  and  non- 
singular.  The  QR  transformation  consists  of  forming  a  sequence  of  matrices 
A^  (k  =  1,2,...)  by  the  relation  [5]> 

(l7) 

where  is  an  orthogonal  matrix  such  that 

<  \  -  \  (l8) 

is  an  upper  triangular  matrix.  In  general,  for  large  k  the  matrix  tends 
to  a  form  in  which  all  the  eigenvalues  are  either  isolated  on  the  diagonal 
or  are  the  eigenvalues  of  2  x  2  diagonal  submatrices. 

The  QR  algorithm  is  always  used  after  the  original  matrix  is 
reduced  to  the  upper  Hessenberg  form,  because*. 

(1)  The  upper  Hessenberg  form  is  invariant  under  the  QR  trans¬ 
formation  [5]- 

(2)  The  number  of  multiplications  involved  in  a  QR  transfor- 

2  3 

mation  is  greatly  reduced,  n  after  reduction  vs.  n  for  the  unreduced 
matrix,  where  n  is  the  order  of  the  matrix. 

(3)  The  QR  algorithm  has  strong  convergence  properties  when 
applied  to  upper  Hessenberg  matrices  [9]- 

There  are  several  stable  methods  for  reducing  any  square  matrix  to  upper 
Hessenberg  form  [ 3 ,4] .  The  method  of  Householder,  which  is  described  in 
the  previous  section,  may  be  used  for  that  purpose.  The  reduction  to 
upper  Hessenberg  form  is  therefore  completed  in  (n  -  2)  steps,  the  r-th 
of  which  is  given  by, 

Ar  =  Ar_1  -  vr  p*  -  (qr  -  ar  \)  \  (r  =  1.2,...,  n-2)  (19) 


li 

II 


IJ 

y 
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where  Aq  is  the  original  matrix  A  and, 


ur  -  (0 0,  ar>r_1  +  Sr,  ar+1>rl,. 

S  4"  afV 

r  Ur 


’  ’ an-l, r-1^ 


t  t 

Pr  =  ur  Ar-1 


q  =  A  -  u 
ur  r-1  r 


v  = 
r 


^  i  ar,r-l) 
Qt  =  pr  vr  ' 


(20) 

(21) 

(22) 

(23) 

(24) 

(25) 


The  sign  of  (a  ,  +  S  )  is  chosen  as  indicated  in  Equation  11.  The  total 
r,  J.  “*  r  5  ^ 

number  of  multiplications  involved  in  this  reducation  is  essentially  ^  n  . 

The  QR  algorithm  is  invariably  used  with  origin  shifts  [3,5,6,10] 
described  by, 

4  r>  -  Wi- 4  (26> 


is  the  origin  shift,  and 
A^  is  the  upper  Hessenberg  form  of  the  original  matrix  A.  In  order  to  speed 
up  convergence,  the  origin  shift  ^  is  chosen  such  that  it  ultimately  ap¬ 
proaches  an  eigenvalue  of  A.  It  often  happens  that  the  matrix  A  has  complex 
conjugate  eigenvalues  which  lead  to  a  complex  value  of  £  •  Francis  [5] 
proposed  the  method  of  double  origin  shift  which  combines  two  QR  iterations 
in  one,  avoiding  complex  arithmetic.  This  method  may  be  described  by  the 
relations, 


where  Q^  is  orthogonal,  R.  is  upper  triangular. 


A  (\  -  Q  r>  (Ak  -  5m  r>  ■  \ 

\+2  =  Wk  V  Wk  (28) 
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where  Q^+1  is  orthogonal,  ^  and  (^+1  are  the  origin  shifts,  and 

is  upper  triangular.  The  origin  shifts  and  are  usually 

taken  as  the  eigenvalues  of  the  last  2x2  diagonal  submatrix,  and  .r>  nee  A 
is  real  they  are  either  both  real  or  complex  conjugate.  Hence  the  matrix 
=  (A^  -  I)  (A^  -  I)  is  real  and  no  complex  arithmetic  is  involved. 

However,  since  is  nonsingular,  A^+g  and  are  determined  by  the  first 
column  of  only,  and  the  determination  is  unique  if  the  subdiagonal  ele¬ 
ments  of  A^_+g  are  positive.  The  first  column  of  is  the  same  as  the  first 
column  of  the  elementary  Hermitian  orthogonal  matrix  Nq  that  eliminates  the 
elements  of  the  first  column  of  the  matrix  below  the  diagonal.  The  first 
column  of  B  has  only  three  non-zero  elements.  They  are  given  by, 

iC  " 

b00  _  a(k)/  (k)_  e  )  +  a(k)  +  T1 

b00  a00  1  00  V  01  10  ^k 

(k)  _  (k),  (k)  a(k)_  )  (29) 

bio  aio  ^  00  ail  V  ' 

b(k)  _  Jk)  (k) 

b20  "  10  21 


where, 


€  -  f  +  t  -  a(k)  +  a(k) 
k  "  S  ^k+l  ”  n-2,n-2  n-l,n-l 

•n  =  t  f  =  a(k)  (k)  _  (k)  a(k) 

'k  ^k  ‘"k+l  n-2,n-2  n-l,n-l  n-2,n-l  n-l,n-2 


Thus  the  matrix  Nq  is  given  by, 

2ynyo 

N  =  I  -  -  P-0-. 

° 


where, 
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0) 


K?)2  *  <>2  * 


(h 


Therefore,  the  matrix  =  NQ  A^  Nq  is  of  the  form. 


(32) 


where  the  underlined  elements  are  those  elements  of  affected  by  the  trans¬ 
formation.  Now  the  matrix  can  be  reduced  to  the  upper  Hessenberg  from 
by  the  orthogonal  transformations, 


A,  0  =  N  _  N  _ 
K+2  n-2  n-3 


...  N?  N*  A.  If  N.  ...  N  ,  N  , 
1  o  k  o  1  n-3  n-2 


where  the  elementary  Hermitian  matrix  is  chosen  to  eliminate  the 
elements  of  the  first  column  of  below  the  subdiagonal.  Therefore, 
Cg  =  will  be  of  the  form 
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Consequently,  Ng  is  chosen  to  eliminate  the  elements  of  the  second  column 
of  Cp  below  the  subdiagonal.,  and  so  forth.  In  general,  the  orthogonal 
matrices  Nr(r  •-  1,2,  .  ..,n-2)  are  given  by, 


N  =  I 
r 


2y  y 

r  r 


(34) 


where. 


/  —  (0, ... .,0,1, s  ,t  ,0,..*..,0) 


r'  r 


in  which  the  first  r  elements  are  zero, 
c(r) 

s  =  r+l,r-l 

r  77 - 7 

c'’  '  +  S 

r,r-l  —  r 

t  .  <&Ui 

r  ew ; :  s 

r,r-l  -  r 


3  -  |7c(r)  )2  +  (c(r)  )2  +  (c(r)  )*H 

3r  “  jj  r,r-l;  '  r+l,r-l;'  1  r+2,r-V  J 


(35) 


(36) 

(37) 

(38) 


After  each  iteration  the  subdiagonal  elements  are  examined  for 
possible  partitioning  into  two  or  more  smaller  upper  Hessenberg  matrices 
and  the  iterations  continued  with  the  bottom  submatrix.  Wilkinson  [ 3]  and 
Martin  [6]  mention  that,  using  the  double  origin  shift  method,  the  average 
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number  of  iterations  per  eigenvalue  for  most  matrices  that  are  encountered 
in  practice  is  less  than  two- 


5-2  Implementat ion 

Figure  8  shows  the  storage  allocation  scheme  for  a  16  x  16  non- 
symmetric  matrix  on  a  l6  FE  machine.  This  storage  scheme  maps  an  element 
a. .  into  the  memory  as  follows: 


EE((i(mod'TpE))  xnTpe 


+  j) 


(mod  P£)  (39) 


where  PE  is  the  number  of  EE's  in  the  machine  and  LxJ  is  the  greatest  in¬ 
teger  less  than  x.  The  implementation  of  the  reduction  of  the  original 
matrix  to  the  upper-Hessenberg  form  and  the  QJR-iterations  is  similar  to  that 
of  the  Householder  tridiagonalization  explained  in  Section  4.2,  and  hence 
will  not  be  discussed  here  in  detail. 

5-3  Efficiency 

As  a  direct  extension  of  the  discussions  in  Sections  4.2  and  4.3 
it  can  be  shown  that  using  Equations  19  -  25  the  reduction  to  the  upper- 
Hessenberg  of  a  64  x  64  matrix  would  take  less  than  10  ms  on  a  64-FE 
machine  with  an  efficiency  higher  than  86%.  It  will  be  shown  that  the  Q.R 
transformations  dominate  the  overall  efficiency  calculations.  Detailed 
examination  of  the  efficiency  of  one  step  in  double  QR-iteration  Cr+-^  = 
follows,  where  is  defined  by  Equations  34  -  38-  By  virtue  of 
Equations  19  -  25,  may  be  written  as, 


C  ,.  =  C  -  v  -  (q_  -01  y  )  v^ 
r+1  r  r  ^r  x  t  r  'r  r 


(40) 
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2 


where  =  y^C  ,  a  =  C  y  ,  v 
r  t  t7  vv  r 


l|yrl 


y  ,  a  =  p  v  ,  and  y  is  defined 
r  r  rrr  r 


by  Equations  35  -  38.  On  v.  parallel  computer  the  efficiency  of  the  process 

used  for  finding  the  square  roots  affects  the  overall  efficiency  of  the 

Q.R- algorithm.  Therefore,  in  order  to  minimize  the  inefficiencies,  we 

describe  the  following  square  root  routine  [11].  Any  positive  floating 

number  Z  may  be  expressed  as  Z  =  2  (x)  where  either  ^  <  x  <  1  or 

1  ^  1  __  .  r„  _m.  r 


^  ^  x  v  — .  Thus  Z  =  2  (v  x)  and  the  problem  is  reduced  to  finding  the 


<  x  < 

square  root  of  x.  We  use  the  recursive  relation. 


I 
I! 

II 


“f.+i 


~ 


(3 


2, 

XU£) 


l  =  0,  1,  2, 


(41) 


where. 


u>0  =  a  +  bx  +  cx 


in  which, 


and, 


a  =  3-16214 
b  =-5.86118 
c  =  4.75000 


a  =  2.22152 
b  =-2.03146 
c  =  0.81250 


for  u  <  x  <  p- 


f  or  i  <  x  <  1 


(42) 


(43) 


(44) 


y 


a.  converges  quadratically  to  1  and  nTx  =  (x)  lim 


II 


;  iv:  - 

■"■Kn.-'M e&b 
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In  figure  9  the  operations  at  a  given  level  in  the  tree  are  identical  and 

shown  to  the  right,  and  the  totals  are  shown  below.  Unlabeled  nodes  are 

assumed  to  use  the  result  computed  immediately  above.  In  this  and  subsequent 

discussions,  the  insignificant  routing  and  broadcasting  times  are  neglected. 

Assuming  that  one  division  is  equivalent  to  five  multiplications,  and  two 

additions  are  equivalent  to  one  multiplication,  the  evaluation  of  the  above 

quantities  on  a  serial  machine  would  effectively  require  35  multiplications. 

On  a  P  >  8  machine  the  same  process  requires  only  the  equivalent  of  20 

multiplications  or  10  u-s.  This  is  clear  from  the  fact  that  the  tree  height 

is  equivalent  to  20  multiplications  and  its  width  does  not  exceed  8.  The 

efficiency  of  the  parallel  computation  of  this  part  is  ■  This  is  a 

20 

rather  conservative  definition  of  efficiency.  It  assumes  that  the  total  time 
’’necessary"  for  a  parallel  machine  to  evaluate  a  function  is  just  the  equiva¬ 
lent  number  of  multiplications  divided  by  the  number  of  PE's  available. 

The  row  vector  p^  =  y^  C  has  its  first  (r  -  l)  elements  equal  to 
r  r 

zero.  Its  formation  requires  the  equivalent  of  3(n  -  r+l)  multiplications. 
The  column  vector  q^  =  y^  has  (n  -  r  -  4)  zero  last  elements  for  r  =  1, 

2,  .  .  .  ,  n  -  5*  Its  construction  requires  the  equivalent  of  (3r  +  10) 
multiplications.  Assuming  that  the  size  of  the  matrix  is  n  =  (m)P£  where  the 
integer  m  >  1,  the  average  number  of  equivalent  multiplications  required 
foi  computing  either  p^  or  q^  on  a  parallel  computer  is  given  by 

T  ?}  O 

-  ,  ^  (m  +  l).  In  order  to  calculate  the  overall  efficiency, 

i  d 

o-i 

observe  that  for  the  matrix  Cy  there  are  m  diagonal  Pg  x  Pg  submatrices; 
for  each,  the  corresponding  portion  of  either  pp  or  q^  requires  three 
equivalent  multiplications  with  an  average  efficiency  of  50$,  and  for  each 
of  the  i  m(m  -  l)  P„  x  P  submatrices  above  the  diagonal,  the  corresponding 

d  Ej  Jj 

portion  of  either  or  q^_  also  requires  three  equivalent  multiplications, 
but  with  an  efficiency  of  100$.  Weighing  the  individual  efficiencies  with 
the  corresponding  number  of  equivalent  multiplications,  the  overall  average 
efficiency  for  computing  either  or  q^  is  calculated  as  follows: 


^r 


(3m)  0.5  +  |  m(m  -  l)  (1.0) 


O 

3m  +  |  m(m 


1) 


m 

m  +  1 


-35- 


Since  Pr  and  q^  contain  some  zero  elements  and  since  the  computation  of 


v  = 
r 


y^  requires  only  two  multiplications  and  that  of  vy 


requires  four  multiplications,  both  v^,  and  cc^  can  be  computed  simultan¬ 
eously  with  pr  and  q^  respectively.  The  computation  of  the  column  vector 
aryr  which  requires  two  multiplications  can  also  be  performed  simultan¬ 
eously  with  the  computation  of  the  matrix  vr  p^-  The  average  number  of 

multiplications  required  for  computing  v^  p^  on  a  parallel  machine  is  the 
same  as  that  of  p^  and  q^  and  with  the  same  efficiency.  The  construction 
of  the  vector  (q^  -  ap  y^)  requires  the  equivalent  of  ^  multiplications 

with  an  efficiency  of  |-=~-\.  The  average  number  of  multiplications  required 

\y) 


for  forming  the  matrix  (q^ 


ar  yr)  on  a  parallel  computer  is  also 
identical  to  that  of  p  or  q  and  has  the  same  efficiency.  The  same  holds 

^  **  t  -h 

for  constructing  the  sum  C  -  v  p  -  (q_  -  cr  y  )  v  .  Neglecting  the  time 

rrrrrrr 

taken  to  find  the  origin  shifts  for  every  iteration,  it  can  be  concluded 
that  the  total  time  required  for  one  double  QR-iteration  on  a  parallel 
computer  of  P  PE's  and  for  a  matrix  of  size  n  =  (m)P„  is  essentially 
((m)PE  -  2) 


[10-75  +  3-75  (m  +  1)3  ;u. s.  with  an  efficiency  of 
0.5  (35/Pe)  +  3-75  m  +  (2.25/P£) 

10.75  +  3-75  (m  +  l) 


Thus,  the  time  required  for  one  QR-iteration  for  a  256  x  256  matrix  on  a 
16 -PE  machine  (assuming  that  the  matrix  can  be  contained  in  the  primary 
memory)  is  19  ms  and  the  efficiency  of  parallel  computation  is  82$.  If 
P£  =  64,  then  the  time  taken  is  7-5  ms  with  an  efficiency  of  52$.  Clearly, 
the  minimum  efficiency  of  the  QR-iteration  occurs  when  m  =  1.  For  example, 
for  n  =  P£  =  64,  the  efficiency  is  22$.  In  order  to  establish  a  rough 
lower  bound  on  the  overall  efficiency  of  the  Q.R  algorithm,  consider  the 
case  for  which  m  =  1  and  when  the  matrix  gets  deflated  by  two  rows  and  two 
columns  every  two  iterations.  This  efficiency  is  given  by. 
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